Introduction

This project analyses Boulder B-cycle data to understand and document any patterns from 2013 to Early 2016. The data analysis in this project is through summaries and visualizations. Also part of the project was to apply Machine Learning on the numerical data to try and make predictions.

How is the analysis divided?

The analysis is divided into 3 big sections:

  1. Summary of Data

  2. Data Visualizations

  3. Machine Learning to predict the Pass Type

  4. Conclusion

1. Summary of Data

##   Rider.Home.System Rider.or.Operator.Number Entry.Pass.Type Bike.Number
## 1   Boulder B-cycle                 R1011535         24-hour         548
## 2   Boulder B-cycle                 R1011722         24-hour         742
## 3   Boulder B-cycle                 R1008367          Annual         578
## 4   Boulder B-cycle                 R1010650         24-hour         616
## 5   Boulder B-cycle                 R1008367          Annual         578
## 6   Boulder B-cycle                 R1055681          Annual         601
##   Checkout.Date Checkout.Day.of.Week Checkout.Time  Checkout.Station
## 1     5/20/2011               Friday    9:24:00 AM      15th & Pearl
## 2     5/20/2011               Friday    9:24:00 AM      15th & Pearl
## 3     5/20/2011               Friday    9:33:00 AM Broadway & Alpine
## 4     5/20/2011               Friday    9:34:00 AM Broadway & Alpine
## 5     5/20/2011               Friday    9:36:00 AM Broadway & Alpine
## 6     5/20/2011               Friday    9:39:00 AM UCAR Center Green
##   Return.Date Return.Day.of.Week Return.Time    Return.Station
## 1   5/20/2011             Friday  9:40:00 AM      26th @ Pearl
## 2   5/20/2011             Friday  9:54:00 AM      15th & Pearl
## 3   5/20/2011             Friday  9:36:00 AM Broadway & Alpine
## 4   5/20/2011             Friday  9:37:00 AM Broadway & Alpine
## 5   5/20/2011             Friday  9:39:00 AM Broadway & Alpine
## 6   5/20/2011             Friday  9:42:00 AM UCAR Center Green
##   Trip.Duration..Minutes.
## 1                      16
## 2                      30
## 3                       3
## 4                       3
## 5                       3
## 6                       3
##                Rider.Home.System  Rider.or.Operator.Number
##  Boulder B-cycle        :243333   M9999957:  9499         
##  Denver B-cycle         :  4666   M9999950:  5684         
##  Madison B-cycle        :   201   M9999952:  5538         
##  Houston B-cycle        :   113   R1028713:  4006         
##  Indy - Pacers Bikeshare:    74   M9999943:  3077         
##  GREENbike              :    38   M9999998:  2835         
##  (Other)                :   119   (Other) :217905         
##            Entry.Pass.Type    Bike.Number       Checkout.Date   
##  24-hour           : 83642   411    :  1821   6/25/2015:   703  
##  7-day             :  5585   584    :  1755   8/2/2015 :   650  
##  Annual            :113041   666    :  1613   8/8/2015 :   639  
##  Maintenance       : 37337   744    :  1608   7/28/2015:   635  
##  Semester (150-day):  8939   665    :  1607   6/26/2015:   621  
##                              699    :  1596   8/5/2015 :   621  
##                              (Other):238544   (Other)  :244675  
##  Checkout.Day.of.Week     Checkout.Time    Checkout.Station  
##  Friday   :39020      12:16:00 PM:   467   Length:248544     
##  Monday   :35182      12:26:00 PM:   455   Class :character  
##  Saturday :36603      12:45:00 PM:   447   Mode  :character  
##  Sunday   :28767      4:12:00 PM :   434                     
##  Thursday :38079      5:05:00 PM :   433                     
##  Tuesday  :34903      12:12:00 PM:   432                     
##  Wednesday:35990      (Other)    :245876                     
##     Return.Date     Return.Day.of.Week      Return.Time    
##  6/25/2015:   706   Friday   :39026    12:04:00 AM:   495  
##  8/2/2015 :   651   Thursday :37881    1:13:00 PM :   451  
##  8/8/2015 :   637   Saturday :36322    12:12:00 PM:   441  
##  7/28/2015:   629   Wednesday:36042    1:51:00 PM :   439  
##  6/26/2015:   624   Monday   :35362    12:15:00 PM:   437  
##  7/11/2015:   624   Tuesday  :34944    12:52:00 PM:   436  
##  (Other)  :244673   (Other)  :28967    (Other)    :245845  
##  Return.Station     Trip.Duration..Minutes.
##  Length:248544      Min.   :    -2.00      
##  Class :character   1st Qu.:     5.00      
##  Mode  :character   Median :    12.00      
##                     Mean   :    63.36      
##                     3rd Qu.:    26.00      
##                     Max.   :181607.00      
## 

The structure of the dataset

There are 248544 observations of 13 variables. The variables include Checkout/Return Stations, Checkout/Return Time, Type of Pass, Day of the Week, Trip Duration, Bike Number and Rider/Operator number. Also included is a location dataset with latitude and longitude information along with other information about the Checkout/Return stations

Corrections to the dataset

There are some errors in the “Rider.Home.System” column. This data is supposed for Boulder but was set to Denver and Houston in some cases, this is not correct. Not a big issue because this variable/column data is not that important in the analysis because it’s a constant and doesn’t add value to the analysis.

NOTE: corrections were made to Checkout/Return Station “RTD”, which is really “14th & Canyon” but was entered incorrectly as “RTD”. This error was found later on in the project analysis but was corrected early on.

2. Data Visualizations

This section involves a lot of visualizations. It’s a combination of univariate and multivariate plots, with the focus on one variable at a time.

Riders or Operators

Fig1: Rider/Operator Count

Fig1: Rider/Operator Count

Fig2: Rider/Operator Count seperated by Pass Type

Fig2: Rider/Operator Count seperated by Pass Type

NOTE: There were a lot of riders with 1-200 rides, so to understand any patterns better, the data was subset to riders with 200 or more rides.

The following trends can be noted from the plots in this section from the subset data:

  1. Some riders really like to use B-cycle for their rides(Fig1). Faceting it by the pass type(Fig2), we get a better understanding of what type of passes they like to use. Annual Pass is the biggest winner among people who use the bikes often(not surprising) but there was a rider who did a little more than 200 rides using the 24 hour pass(surprised that the person didn’t think of other available pass types).

  2. The number of rides by riders using Maintenance pass is very interesting, there are a lot of rides by a few users. This indicates that these were operators who regularly used and fixed bikes.

Pass Type

Fig3: Pass Type Count

Fig3: Pass Type Count

Fig4: Pass Type Count seperatred by Day of the Week

Fig4: Pass Type Count seperatred by Day of the Week

  1. There are 4 pass types as noted from the plot above(Fig3). It is clear that the Anuual pass is definitely the most popular, followed by the 24-hour type pass. 150-day and 7-day passes pale in comparison. Maintainance is another one which has relatively high use compared to 7-day and Semester(150-day) type.

  2. From Fig4 one thing which stands out is that 24-hour pass type is used way more than Annual pass on the weekends. whereas on the weekdays Annual pass is still the most widely used. Semester and 7-day pass usage is comparatively very low.

Bike Numbers

Fig5: Bike# Count seperated by Pass Type

Fig5: Bike# Count seperated by Pass Type

Fig6: Bike# Count seperated by Day of the Week

Fig6: Bike# Count seperated by Day of the Week

Fig7: Bike# Count seperated by Pass Type & Day of the Week

Fig7: Bike# Count seperated by Pass Type & Day of the Week

Was not expecting any trends when analyzing bike numbers but surprisingly there are some trends.

Except for 7-day and Semester pass types, the bike numbers in the middle seem to be most used for other pass types. This might be related to the stations they are at, as there are stations which are more popular than others, as we will see below.

Day of the week

Fig8: Day of the Week Count

Fig8: Day of the Week Count

Fig9: Day of the Week Count seperated by Pass Type

Fig9: Day of the Week Count seperated by Pass Type

  1. Friday overall is the most popular day of the week for ridership, followed by Thursday(surprising) and then Saturday. Monday, Tuesday & Wednesday usage is very close, whereas Sunday usage is markedly lower compared to other days(Fig8)

  2. When looking at the data faceted by the pass type(Fig9), Annual pass holders like to use their passes on weekdays(the distribution is almost gaussian like). It is completely opposite for the users of 24-hour pass type, they like riding on weekends(as it was noted in the Pass Type section)

  3. Maintenance rides are common on weekdays.

  4. Semester pass holders like to use their pass on the weekdays with Tuesday being the most popular.

  5. In the case of 7-day pass, there is no visible trend but Thursday is most popular followed by Friday & Saturday.

Trip Duration

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     -2.00      5.00     12.00     63.36     26.00 181600.00
Fig10:Box plots of Trip Duration

Fig10:Box plots of Trip Duration

Fig10:Box plots of Trip Duration

Fig10:Box plots of Trip Duration

Fig11: Trip Duration Distribution

Fig11: Trip Duration Distribution

Fig12: Trip Duration Distribution seperated by Pass Type

Fig12: Trip Duration Distribution seperated by Pass Type

Fig13: Trip Duration Distribution seperated by Day of the Week

Fig13: Trip Duration Distribution seperated by Day of the Week

Fig14: Trip Duration Distribution seperated by Pass Type and Day of the Week

Fig14: Trip Duration Distribution seperated by Pass Type and Day of the Week

This section also uses a subset of the data. There were a lot of outliers in trip duration, so the trip duration was subset to rides within and including 60 minutes.

  1. Overall(Fig11), the trip duration is a gaussian distribution with the peak at 4-6 minutes and falls pretty hard and kind of stabilizes from the 32 minute mark.

  2. Faceting it by pass type(Fig12), for the 24 hour pass type, the most common trip duration is 12-13 minutes, 10-13 minutes for 7-day pass, 4-6 minutes for Annual, 1 minute for Maintenance(quick maintenance rides!) and 6-8 minutes for semester type.

  3. When looking at the plots by the day of the week(Fig 13), 4-6 minute period still is the most popular on weekdays but not on weekends. With 10-12 minute seeming to be more popular, may be this is due to the fact that people are not in a hurry on the weekends.

  4. Combining the pass type and weekday(Fig 14), annual pass holders trip duration pattern doesn’t change much, Point 1. still holds true. For 24 hour pass type, the trip duration seems to be in the upper ranges, 10+ minutes. The 7-day pass type trip duration doesn’t show a clear pattern from the plots. 1-minute maintenance seems to be the most common turn around time. 6-8 minutes trip duration seems to still be the most common for semester type pass.

Checkout Station

Fig15: Checkout Station Count

Fig15: Checkout Station Count

Fig16: Checkout Station Count seperated by Day of the Week

Fig16: Checkout Station Count seperated by Day of the Week

Fig17: Checkout Station Count seperated by Pass Type

Fig17: Checkout Station Count seperated by Pass Type

Fig18: Trip Duration Distribution seperated by Checkout Station

Fig18: Trip Duration Distribution seperated by Checkout Station

  1. 15th & Pearl, 13th and Spruce are the 2 most poular check out stations in Boulder(Fig15). There is a close tie between 11th and Pearl and Municipal Building stations. Greenhouse and Gunbarrel North are the least used stations, 14th and Walnut office might be an error as this location doesn’t have lattitude, longitude listed.

  2. Faceting it by the day of the week(Fig16), 15th & Pearl is still the most popular checkout station. With 13th and Spruce along with Municipal building being the 2nd most popular checkout stations from Mon-Thu and 11th & Pearl from Fri-Sun.

  3. Analyzing the checkout stations by the pass type(Fig17), 15th & Pearl is still the most popular checkout station for all pass types except for the semester pass type. For the 24-hour pass type, 11th and Pearl is the 2nd most popular checkout station followed by 19th @ Broadway. The village seems to be the 2nd most popular station for the 7-day pass type. The distribution for Annual pass type doesn’t change much with the overall pattern noted in point 1 because this is the most popular pass type.

  4. One thing to be noted are the spikes in maintenance(Fig16 & Fig17) in locations like The Village and 26th @ Pearl which are not in line with the overall checkout station popularity pattern. This might indicate that the bikes at those statiopns might have been subject to more rough use or a batch of bikes had a few defects.

  5. Faceting the trip duration(Fig18) by Checkout station there are not any major surprises and the overall pattern across the popular stations seems to be still true, ride times were in the 6-10 minutes range.

Return Station

Fig19: Return Station Count seperated by Pass Type

Fig19: Return Station Count seperated by Pass Type

Fig20: Return Station Count seperated by Day of the Week

Fig20: Return Station Count seperated by Day of the Week

Fig21: Return Station Count seperated by Pass Type

Fig21: Return Station Count seperated by Pass Type

Fig22: Trip Duration Distribution seperated by Return Station

Fig22: Trip Duration Distribution seperated by Return Station

No surprises from the Return Station analysis, most if not all of the points from the pervious section apply here as well.

Checkout Date

Fig23: Checkout Date Distribution

Fig23: Checkout Date Distribution

Fig24: Checkout Date Distribution seperated by Pass Type

Fig24: Checkout Date Distribution seperated by Pass Type

Fig25: Checkout Date Distribution seperated by Day of the Week

Fig25: Checkout Date Distribution seperated by Day of the Week

Fig26: Checkout Date Distribution seperated by Pass Type and Day of the Week

Fig26: Checkout Date Distribution seperated by Pass Type and Day of the Week

  1. The number of checkouts has progressively increased over the years from 2013 to 2016(Fig23). There is definitly a pattern in terms of usage, the summer(May-August) months seeing an increase in checkouts with a dip on on either side of the summer months. This definitely makes sense as people tend to ride less in the winter months. Among the popular summer months, July-August have the biggest checkouts across the years

  2. Viewing the plots by the type of the pass(Fig24), we can see that all pass types have seen an increase in usage since Boulder B-cycle was introduced. 7-day pass saw a big increase in the summer of 2015 and the Semester type pass also saw a big increase since it was introduced in early 2014.

  3. Among the annual pass holders, October of 2015(Fig24) had more users than any other month in the warmer months. This is surprising, I guess October must have been warm or there must have been a lot of events in the Boulder area that month.

  4. Maintenance generally follows the trend of an increase in the number of instances of maintenance in the summer months and a decrease in the colder months. One anomaly was that April of 2015 had the highest instances of maintenance for that year but it wasn’t the most popular month in terms of ridership. This might indicate that Boulder B-cycle was preparing in advance for the popular ridership summer months. This might be a good guess because the maintenance was lower in the months following April for 2015 across all pass types.

  5. Analysing checkouts divided by the day of the week(Fig25). Only Tue-Wed deviate from the general trend that August is the most popular month followed by July. In the case of Tue-Wed the roles of July and August get reversed.

  6. Doing a multivariate analysis(Fig26) we can see finer trends in popular days across months and across pass types but there are no new points(other than the ones already documented) to be noted down.

Checkout/Return Time(Part 1)

Fig27: Checkout Time Distribution

Fig27: Checkout Time Distribution

Fig28: Return Time Distribution

Fig28: Return Time Distribution

Fig29: Checkout Time Distribution seperated by Pass Type

Fig29: Checkout Time Distribution seperated by Pass Type

Fig30: Return Time Distribution seperated by Pass Type

Fig30: Return Time Distribution seperated by Pass Type

  1. Checkout/Return start slowly a little before 7(Fig27), followed by a big increase just after 7:00. From that time on, the checkouts/returns slightly decrease but then increase again from 10:00 to 11:15, then seeing a dip again at around 13:00 followed by an increase till 15:00. There is a dip again followed by an increase in ridership after 18:00.

  2. The return and checkout times follow each other closely because the overall the most popular riding time in Boulder is less than 10 minutes.

  3. 24 hour pass type holders checkout/return times(Fig29/30) start off strong in the morning and slowly decrease except for one spike at 10:00 and then starts increasing at around 14:30, hitting a peak at 18:00 and then slowly decreasing.

  4. Among 7-day pass holders(Fig29/30) 14:30 seems to be the peak for checkout/returns. The increase in checkout/return starts at around 11:30 with the peak at 14:30 . This pattern also holds true for semester pass holders.

  5. Annual pass type usage patterns follows the overall pattern described in point 1. Whereas, for maintenance, the peak is in the morning before 11:00 followed by a big dip and then a big increase after 15:00

Checkout/Return Time(Part 2)

Fig31: Checkout Time Distribution seperated by Day of the Week

Fig31: Checkout Time Distribution seperated by Day of the Week

Fig32: Return Time Distribution seperated by Day of the Week

Fig32: Return Time Distribution seperated by Day of the Week

Fig33: Checkout Time Distribution seperated by Pass Type & Day of the Week

Fig33: Checkout Time Distribution seperated by Pass Type & Day of the Week

Fig34: Return Time Distribution seperated by Pass Type & Day of the Week

Fig34: Return Time Distribution seperated by Pass Type & Day of the Week

  1. Among 24 hour pass holders(Fig31/32) from Tue-Thu the checkout/return pattern is different from Fri-Mon. Tue-Thu checkout/returns doesn’t dip as much in the middle of the day compared to Fri-Mon.

  2. 7-day pass checkout/returns(Fig31/32) peak at around 15:00 from Mon-Thu, with Fri-Sun seeing peaks and drops throughout the day. The same pattern applies for semester pass holders.

  3. Annual pass holders like to use the service around the 11:00, 15:00 and 18:00 on weekdays. Where as on Saturdays and Sundays the peak usage in the morning and evenings.

  4. Maintenance peaks at 15:00 on weekdays and mornings/evenings on Saturdays and Sunday(with a big drop in maintenance in the middle)

Heat Map of Stations

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boulder,+Colorado&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boulder,%20Colorado&sensor=false
Fig35: Heat Map of Checkout Stations

Fig35: Heat Map of Checkout Stations

Fig36: Heat Map of Return Stationss

Fig36: Heat Map of Return Stationss

The size of the circle represents the overall number checkouts/returns per station since B-cycle started. From the two maps(Fig35 & Fig36) it is clear that the stations in downtown are most frequently used. The stations near downtown and in/near the University are behind the downtown stations in terms of usage.

3. Machine Learning to predict the Pass Type

# Subset the data keeping the necessary variables
mlsubset <- dataset[c(3, 13)]

# Create partition
trainIndex <- createDataPartition(mlsubset$Entry.Pass.Type, p = 0.8, list = FALSE, times = 1)
trainingset <- mlsubset[trainIndex, ]
testset <- mlsubset[-trainIndex, ]

# Get the necessary variables for analysis
# Split the data set for 10-fold cross validation, train on 9, test on 1 for all combinations
trainControl <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"

# Evaluate 3 different algorithms, make sure the same seed is used
# Linear Discriminant Analysis
set.seed(7)
fit.lda <- train(Entry.Pass.Type~., data = trainingset, method = "lda", 
                 metric = metric, trControl = trainControl)
## Loading required package: MASS
# Classification and Regression Tree
set.seed(7)
fit.cart <- train(Entry.Pass.Type~., data = trainingset, method = "rpart", 
                  metric = metric, trControl = trainControl)
## Loading required package: rpart
# Naive Bayes
set.seed(7)
fit.nb <- train(Entry.Pass.Type~., data = trainingset, method = "nb", 
                metric = metric, trControl = trainControl)
## Loading required package: klaR
# Summarize accuracy of models
results <- resamples(list(lda = fit.lda, cart = fit.cart, nb = fit.nb))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, nb 
## Number of resamples: 10 
## 
## Accuracy 
##        Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## lda  0.4551  0.4552 0.4553 0.4554  0.4555 0.4557    0
## cart 0.6099  0.6123 0.6129 0.6134  0.6155 0.6167    0
## nb   0.5481  0.5549 0.5580 0.5568  0.5589 0.5613    0
## 
## Kappa 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## lda  0.0007871 0.001297 0.001471 0.001569 0.001746 0.002487    0
## cart 0.3663000 0.370100 0.372800 0.372800 0.376200 0.377200    0
## nb   0.2117000 0.219100 0.224500 0.223000 0.226200 0.232500    0
# Dot plot of the results
dotplot(results)

# Compare against the test set
predictions <- predict(fit.cart, testset)
confusionMatrix(predictions, testset$Entry.Pass.Type)
## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           24-hour 7-day Annual Maintenance Semester (150-day)
##   24-hour              10553   428   4175        2578                371
##   7-day                    0     0      0           0                  0
##   Annual                5518   653  17300        2356               1333
##   Maintenance            657    36   1133        2533                 83
##   Semester (150-day)       0     0      0           0                  0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6113         
##                  95% CI : (0.607, 0.6156)
##     No Information Rate : 0.4548         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3685         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 24-hour Class: 7-day Class: Annual
## Sensitivity                  0.6309      0.00000        0.7652
## Specificity                  0.7710      1.00000        0.6361
## Pos Pred Value               0.5829          NaN        0.6370
## Neg Pred Value               0.8046      0.97753        0.7646
## Prevalence                   0.3365      0.02247        0.4548
## Detection Rate               0.2123      0.00000        0.3480
## Detection Prevalence         0.3642      0.00000        0.5464
## Balanced Accuracy            0.7009      0.50000        0.7007
##                      Class: Maintenance Class: Semester (150-day)
## Sensitivity                     0.33923                   0.00000
## Specificity                     0.95481                   1.00000
## Pos Pred Value                  0.57024                       NaN
## Neg Pred Value                  0.89100                   0.96405
## Prevalence                      0.15022                   0.03595
## Detection Rate                  0.05096                   0.00000
## Detection Prevalence            0.08936                   0.00000
## Balanced Accuracy               0.64702                   0.50000

This section explored whether it was possible to use Machine Learning algorithms on the numeric data(trip duration) to predict the pass type with a high accuracy. Three algorithms were tested(LDA, CART and Naive Bayes). Among the 3, CART(or better known as Decision Tree) had the best results but the accuracy was still low <65%. Since trip duration was the only quantified numeric data the algorithms didn’t perform that well.

If there was another numeric variable which Boulder B-cycle had provided, may be the distance covered during each trip, it would have probably helped with the classification and accuracy of classification.

Conclusion

There were a lot of observations made in this document, listed here are the major finds from the dataset.